CMSC320 Final Project

Motivation

Monitoring and tracking cirme records of cities seems fairly important, it not only implicitly states how crimes are committed but also gives the authorities a better way to analyze the features of crimes across cities and enforce more security to reduce the crimes efficiently. Therefore, we are giving this turorial of how to do the analysis of a San Fransico Crime Data Set in a way of data science that we learned in class of CMSC320.

Introduction

In this project, we basing on the dataset on the Kaggle provided by Roshan Sharma to give a tutorial how to do the anaylsis. Generally we are splitting this tutorial into 3 parts:

  • Evaluating data basing on single attributes

  • Attribute analysis with application of Interactive maps

  • Prediction and regression analysis

By doing this tutorial, we could not only teach people how to take analysis of dataset in a way of data science, but also give suggestions on how authorities should do to reduce crimes by our anaylysis on the dataset.

Data Preparation

First, we need to load some libraries needed for our project:

library(rvest)
library(tidyverse)
library(tidyr)
library(lubridate)
library(dplyr) 
library(leaflet)
library(stringi)
library(broom)
library(tree)

We get data from Roshan Sharma’s Kaggle page, then we get our dataset by read.csv

data <- read.csv("Police_Department_Incidents_-_Previous_Year__2016_.csv")
head(data,10)
##    IncidntNum       Category                                       Descript
## 1   120058272    WEAPON LAWS                      POSS OF PROHIBITED WEAPON
## 2   120058272    WEAPON LAWS FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE
## 3   141059263       WARRANTS                                 WARRANT ARREST
## 4   160013662   NON-CRIMINAL                                  LOST PROPERTY
## 5   160002740   NON-CRIMINAL                                  LOST PROPERTY
## 6   160002869        ASSAULT                                        BATTERY
## 7   160003130 OTHER OFFENSES                               PAROLE VIOLATION
## 8   160003259   NON-CRIMINAL                                    FIRE REPORT
## 9   160003970       WARRANTS                                 WARRANT ARREST
## 10  160003641 MISSING PERSON                                   FOUND PERSON
##    DayOfWeek                   Date  Time PdDistrict     Resolution
## 1     Friday 01/29/2016 12:00:00 AM 11:00   SOUTHERN ARREST, BOOKED
## 2     Friday 01/29/2016 12:00:00 AM 11:00   SOUTHERN ARREST, BOOKED
## 3     Monday 04/25/2016 12:00:00 AM 14:59    BAYVIEW ARREST, BOOKED
## 4    Tuesday 01/05/2016 12:00:00 AM 23:50 TENDERLOIN           NONE
## 5     Friday 01/01/2016 12:00:00 AM 00:30    MISSION           NONE
## 6     Friday 01/01/2016 12:00:00 AM 21:35   NORTHERN           NONE
## 7   Saturday 01/02/2016 12:00:00 AM 00:04   SOUTHERN ARREST, BOOKED
## 8   Saturday 01/02/2016 12:00:00 AM 01:02 TENDERLOIN           NONE
## 9   Saturday 01/02/2016 12:00:00 AM 12:21   SOUTHERN ARREST, BOOKED
## 10    Friday 01/01/2016 12:00:00 AM 10:06    BAYVIEW           NONE
##                    Address         X        Y
## 1   800 Block of BRYANT ST -122.4034 37.77542
## 2   800 Block of BRYANT ST -122.4034 37.77542
## 3    KEITH ST / SHAFTER AV -122.3889 37.72998
## 4   JONES ST / OFARRELL ST -122.4130 37.78579
## 5     16TH ST / MISSION ST -122.4197 37.76505
## 6    1700 Block of BUSH ST -122.4261 37.78802
## 7      MARY ST / HOWARD ST -122.4057 37.78088
## 8     200 Block of EDDY ST -122.4118 37.78398
## 9        4TH ST / BERRY ST -122.3934 37.77579
## 10 100 Block of CAMERON WY -122.3872 37.72097
##                                 Location         PdId
## 1   (37.775420706711, -122.403404791479) 1.200583e+13
## 2   (37.775420706711, -122.403404791479) 1.200583e+13
## 3  (37.7299809672996, -122.388856204292) 1.410593e+13
## 4  (37.7857883766888, -122.412970537591) 1.600137e+13
## 5  (37.7650501214668, -122.419671780296) 1.600027e+13
## 6   (37.788018555829, -122.426077177375) 1.600029e+13
## 7  (37.7808789360214, -122.405721454567) 1.600031e+13
## 8  (37.7839805592634, -122.411778295992) 1.600033e+13
## 9  (37.7757876218293, -122.393357241451) 1.600040e+13
## 10 (37.7209669615499, -122.387181635995) 1.600036e+13

There are 12 attributes, their data types and attributes’ descriptions from the Kaggle websites.

Num Name Type Description
1 IncidntNum categorical Incident Number
2 Category categorical unordered Description of Crime
3 DayOfWeek categorical unordered Day of Week when the crime happened
4 Date Datetime Date
5 Time Datetime Time
6 PdDistrict categorical unorded District
7 Resolution categorical unorded Kind of Punishment given to the criminal to resolve the case
8 Address Geolocation Address where the crime scene happened
9 X Geolocation Latitude of the crime Location
10 Y Geolocation Longitude of the Crime Location
11 Location Geolocation Exact Location Name
12 PdId other Pd Id

Let’s tidy the data: * First, we ignore the last attribute pd Id because it is not so useful to take analysis * Second We deal with the date here, as you could see, the time part in Date attribute is always 12:00:00 Am, so we would like to take it off make Date and Time attribute a datatype of datetime * For better comparsion, we pull the numeric value of Month and hour in Date and Time attributes

tidy <- data %>%
  mutate(Time = hm(Time))%>%
  mutate(hour = hour(Time)) %>%
  mutate(Date = mdy_hms(Date)) %>%
  mutate(Month = format(Date, "%m")) %>%
  select(-PdId)
head(tidy)
##   IncidntNum     Category                                       Descript
## 1  120058272  WEAPON LAWS                      POSS OF PROHIBITED WEAPON
## 2  120058272  WEAPON LAWS FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE
## 3  141059263     WARRANTS                                 WARRANT ARREST
## 4  160013662 NON-CRIMINAL                                  LOST PROPERTY
## 5  160002740 NON-CRIMINAL                                  LOST PROPERTY
## 6  160002869      ASSAULT                                        BATTERY
##   DayOfWeek       Date       Time PdDistrict     Resolution
## 1    Friday 2016-01-29  11H 0M 0S   SOUTHERN ARREST, BOOKED
## 2    Friday 2016-01-29  11H 0M 0S   SOUTHERN ARREST, BOOKED
## 3    Monday 2016-04-25 14H 59M 0S    BAYVIEW ARREST, BOOKED
## 4   Tuesday 2016-01-05 23H 50M 0S TENDERLOIN           NONE
## 5    Friday 2016-01-01     30M 0S    MISSION           NONE
## 6    Friday 2016-01-01 21H 35M 0S   NORTHERN           NONE
##                  Address         X        Y
## 1 800 Block of BRYANT ST -122.4034 37.77542
## 2 800 Block of BRYANT ST -122.4034 37.77542
## 3  KEITH ST / SHAFTER AV -122.3889 37.72998
## 4 JONES ST / OFARRELL ST -122.4130 37.78579
## 5   16TH ST / MISSION ST -122.4197 37.76505
## 6  1700 Block of BUSH ST -122.4261 37.78802
##                                Location hour Month
## 1  (37.775420706711, -122.403404791479)   11    01
## 2  (37.775420706711, -122.403404791479)   11    01
## 3 (37.7299809672996, -122.388856204292)   14    04
## 4 (37.7857883766888, -122.412970537591)   23    01
## 5 (37.7650501214668, -122.419671780296)    0    01
## 6  (37.788018555829, -122.426077177375)   21    01

Evaluation on Single Attribute

  • First, let’s look at the distribution of the number of crimes in year of 2016, the differences of number of crimes among all months are not pretty huge.
table(tidy$Month)
## 
##    01    02    03    04    05    06    07    08    09    10    11    12 
## 12946 12092 12362 12317 12713 12076 12166 12428 12473 13331 12670 12926

we use bar graph to give a visualization of the connection of how number of crimes differes from months. By the graph below, we could see not much height differences among each bars, but the bars of January and October are relatively taller than others which implies that the numbe of cirmes in January and October are relatively higher than those in other months. Therefore, We could suggest that authorities should enforce more security and shifts especially in January and October

tidy %>%
  group_by(Month)%>%
  summarize(num_incident = n()) %>%
  ggplot(mapping = aes(x = Month, y = num_incident)) + geom_bar(stat = "identity")

  • Second, let’s look the crime distribution in time(by hours):
table(tidy$hour)
## 
##    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 6941 4359 3494 2553 1885 1744 2518 3894 5575 5865 6483 6786 9021 7268 7621 8329 
##   16   17   18   19   20   21   22   23 
## 8656 9559 9718 8981 8098 7480 7099 6573

we could also use a bar graph to see the crime distribution of all crimes in hours. As you could see from the plot and table above, in year of 2016, the least crimes commited in time period of 05:00 - 05:59 and the most crimes commited in time period of 18:00 - 18:59. Baseing on boxplot, we could see that the peroid of 1:00 - 11:59 AM is the period that crime commited under the average value. Therefore, Police officer can have more security check ands shifts around city in time period of 12:00 - 00:59.

tidy %>%
  group_by(hour) %>% 
    summarize(num_incident = n()) %>% 
  ggplot(mapping = (aes(x = hour, y = num_incident))) + geom_boxplot() + geom_bar(stat = "identity")

  • Distribution of crimes in a week: let’s look at how crimes are distributed within a week in a year of 2016.
sort(table(tidy$DayOfWeek))
## 
##    Sunday    Monday   Tuesday Wednesday  Thursday  Saturday    Friday 
##     20205     20783     21242     21332     21395     22172     23371

From the table, there is not much difference between the distribution in each day of week, and the top three days that crimes mostly commited are Friday, Saturday and Thrusday. We use bar graph to visualize the data. From the data, We could suggest that authorities should enforce more security and shifts around weekend (starting on Friday)

tidy %>%
  group_by(DayOfWeek) %>% 
    summarize(num_incident = n()) %>% 
  ggplot(mapping = (aes(x = DayOfWeek, y = num_incident))) + geom_boxplot() + geom_bar(stat = "identity")

sort(table(tidy$Category))
## 
##                        TREA     PORNOGRAPHY/OBSCENE MAT 
##                           3                           4 
##                    GAMBLING                  BAD CHECKS 
##                          20                          34 
##  SEX OFFENSES, NON FORCIBLE                   LOITERING 
##                          40                          42 
##             FAMILY OFFENSES                   EXTORTION 
##                          53                          60 
##                     BRIBERY                     SUICIDE 
##                          66                          69 
##                     RUNAWAY                 LIQUOR LAWS 
##                         140                         156 
##                EMBEZZLEMENT                  KIDNAPPING 
##                         168                         257 
##                       ARSON DRIVING UNDER THE INFLUENCE 
##                         286                         378 
##                 DRUNKENNESS      FORGERY/COUNTERFEITING 
##                         465                         619 
##                PROSTITUTION          DISORDERLY CONDUCT 
##                         641                         658 
##           RECOVERED VEHICLE             STOLEN PROPERTY 
##                         736                         882 
##      SEX OFFENSES, FORCIBLE                 WEAPON LAWS 
##                         940                        1658 
##                    TRESPASS             SECONDARY CODES 
##                        1812                        1841 
##                       FRAUD                     ROBBERY 
##                        2635                        3299 
##               DRUG/NARCOTIC              MISSING PERSON 
##                        4243                        4338 
##              SUSPICIOUS OCC                    BURGLARY 
##                        5782                        5802 
##                    WARRANTS               VEHICLE THEFT 
##                        5914                        6419 
##                   VANDALISM                     ASSAULT 
##                        8589                       13577 
##                NON-CRIMINAL              OTHER OFFENSES 
##                       17866                       19599 
##               LARCENY/THEFT 
##                       40409

By looking at the table of categories, the top three crimes commited in San Franciso are LARCENY/THEFT, OTHER OFFENSES and NON-CRIMINAL. A good way to do visualize is to build up a pie chart, then we could see the proportion and differences among all categories. As you could see the largest proportion LARCENY/THEFT is more than a quarter.Therefore, we could suggest authorities should take action and prevention to crimes of larceny and theft more.

tidy %>%
  group_by(Category) %>%
  summarize(num_incident = n()) %>% 
  ggplot(aes(x="", y=num_incident, fill=Category)) +
  geom_bar(stat="identity", width=1) +
  coord_polar("y", start=0)

In this part, we showed you a easy way of praparing, tidying and visualizing of the data step by step. You will understand the diffence and distribution of attributes in the dataset easily by doing this.

Exploratory Data Analysis

A useful visualization is for geographic data is using the interactive map. Each incident has a location coordinates which let us able to see the distribution of crime incidents in San Francisco for our data. In this section, we want to have a better understanding of whether there’s a certain area in San Francisco that has a higher criminal rate and if there’s such area, is there any time in a day or any day in a week has higher criminal rate?

First, we set the map view using the latitudes and longitudes of San Francisco:

map <- leaflet(tidy) %>%
  addTiles() %>%
  setView(lat=37.7740, lng=-122.4313, zoom=11)
map

The following table shows how we interpret our data set in the interactive map:

Color Incident Time
yellow 6am - 12 am
navy 12pm - 6pm
red 6pm - 12am
black 12am - 6am
Color Day of Week
red Monday
orange Tuesday
yellow Wednesday
green Thursday
blue Friday
navy Saturday
purple Sunday

Then, we need to set the elements to display the data, popup information, different colors for different time or day of week, and the icons.

This one is for Incident Time:

color <- function(tidy){
  sapply(tidy$hour, function(hour){
    if (as.integer(hour) >= 6 & as.integer(hour) < 12){
      "yellow"
    } else if (as.integer(hour) >=12 & as.integer(hour) < 18){
      "navy"
    } else if (as.integer(hour) >= 18){
      "red"
    } else {
      "black"
    }
  })
}

icons <- awesomeIcons(
  icon = 'ios-close',
  iconColor = 'black',
  library = 'ion',
  markerColor = color(tidy)
)

label <- paste("<b>Day of Week: </b>", tidy$DayOfWeek, "<br>",
               "<b>Address: </b>", tidy$Address, "<br>",
               "<b>Category: </b>", tidy$Category, "<br>",
               "<b>Description: </b>", tidy$Descript, "<br>",
               "<b>Resolution: </b>", tidy$Resolution, "<br>")

We use markers to represent each entity in the samples that have different incident times.

map <- map %>%
  addAwesomeMarkers(
    data = tidy,
    lng = tidy$X,
    lat = tidy$Y,
    icon = icons,
    popup = label, 
    clusterOptions = markerClusterOptions(),
    group = 'time'
  ) %>%
  addLegend(position = "bottomright", colors = c("yellow", "navy", "red", "black"),
            labels = c("6am - 12 am", "12pm - 6pm", "6pm - 12am", "12am - 6am"), 
            title = "Different Incident Time", group = 'time')

This one is for Day of Week:

color2 <- function(tidy){
  sapply(tidy$DayOfWeek, function(DayOfWeek){
    if (stri_cmp(DayOfWeek, "Monday") == 0){
      "red"
    } else if (stri_cmp(DayOfWeek, "Tuesday") == 0){
      "orange"
    } else if (stri_cmp(DayOfWeek, "Wednesday") == 0){
      "yellow"
    } else if (stri_cmp(DayOfWeek, "Thursday") == 0){
      "green"
    } else if (stri_cmp(DayOfWeek, "Friday") == 0){
      "blue"
    } else if (stri_cmp(DayOfWeek, "Saturday") == 0){
      "navy"
    } else {
      "purple"
    }
  })
}

We use circles to represent each entity in the samples that have a different incident day of the week.

map <- map %>%
  addCircleMarkers(
    data = tidy,
    lng = tidy$X,
    lat = tidy$Y,
    color = color2(tidy), 
    clusterOptions = markerClusterOptions(),
    group = 'day'
  ) %>%
  addLegend(position = "bottomleft", colors = c("red", "orange", "yellow", "green", "blue", "navy", "purple"), 
            labels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"),
            title = "Different Day of Week", group = 'day')

Then we combine these two maps together:

map <- map %>%
  addLayersControl(overlayGroups = c('time', 'day'), options = layersControlOptions(collapsed = FALSE)) %>%
  hideGroup("day")
map

Using the interactive map, we can see that overall in San Francisco has a pretty even criminal rate except for the southwest of San Francisco has a relatively lower crime rate. If we zoom in the map, we can see that most of the incidents either happened at 6pm - 12am or 12am - 6am. Whereas we can hardly tell which day in a week have a higher crime rate, it seems uniformly distributed using an interactive map. Therefore, we can say that most of the incident happened in San Francisco is between 6pm to 6 am in a week.

Regression analysis

Regression is a powerful and commonly used approach to modeling the relationship between a scalar response and a predictor.We use it extensively in exploratory data analysis and in statistical analyses since it fits into the statistical framework we saw in the last unit, and thus lets us do things like construct confidence intervals and hypothesis testing for relationships between variables. It also provides predictions for continuous outcomes of interest. For more detail of regression in r , click this link

For this example, we want to check if the resolution is related to the time (hour and month) of incidents.

Since we can only use numerical predictors in regression models. We want to create new dummy predictors to encode the value of the categorical predictor. We have use the new attribute reso to indicate if the resolution is “None” or “arrested” by using 0 and 1.

tidy_new2 <- tidy %>%
  mutate(reso = ifelse(Resolution == "NONE", 0, 1))
head(tidy_new2,10) 
##    IncidntNum       Category                                       Descript
## 1   120058272    WEAPON LAWS                      POSS OF PROHIBITED WEAPON
## 2   120058272    WEAPON LAWS FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE
## 3   141059263       WARRANTS                                 WARRANT ARREST
## 4   160013662   NON-CRIMINAL                                  LOST PROPERTY
## 5   160002740   NON-CRIMINAL                                  LOST PROPERTY
## 6   160002869        ASSAULT                                        BATTERY
## 7   160003130 OTHER OFFENSES                               PAROLE VIOLATION
## 8   160003259   NON-CRIMINAL                                    FIRE REPORT
## 9   160003970       WARRANTS                                 WARRANT ARREST
## 10  160003641 MISSING PERSON                                   FOUND PERSON
##    DayOfWeek       Date       Time PdDistrict     Resolution
## 1     Friday 2016-01-29  11H 0M 0S   SOUTHERN ARREST, BOOKED
## 2     Friday 2016-01-29  11H 0M 0S   SOUTHERN ARREST, BOOKED
## 3     Monday 2016-04-25 14H 59M 0S    BAYVIEW ARREST, BOOKED
## 4    Tuesday 2016-01-05 23H 50M 0S TENDERLOIN           NONE
## 5     Friday 2016-01-01     30M 0S    MISSION           NONE
## 6     Friday 2016-01-01 21H 35M 0S   NORTHERN           NONE
## 7   Saturday 2016-01-02      4M 0S   SOUTHERN ARREST, BOOKED
## 8   Saturday 2016-01-02   1H 2M 0S TENDERLOIN           NONE
## 9   Saturday 2016-01-02 12H 21M 0S   SOUTHERN ARREST, BOOKED
## 10    Friday 2016-01-01  10H 6M 0S    BAYVIEW           NONE
##                    Address         X        Y
## 1   800 Block of BRYANT ST -122.4034 37.77542
## 2   800 Block of BRYANT ST -122.4034 37.77542
## 3    KEITH ST / SHAFTER AV -122.3889 37.72998
## 4   JONES ST / OFARRELL ST -122.4130 37.78579
## 5     16TH ST / MISSION ST -122.4197 37.76505
## 6    1700 Block of BUSH ST -122.4261 37.78802
## 7      MARY ST / HOWARD ST -122.4057 37.78088
## 8     200 Block of EDDY ST -122.4118 37.78398
## 9        4TH ST / BERRY ST -122.3934 37.77579
## 10 100 Block of CAMERON WY -122.3872 37.72097
##                                 Location hour Month reso
## 1   (37.775420706711, -122.403404791479)   11    01    1
## 2   (37.775420706711, -122.403404791479)   11    01    1
## 3  (37.7299809672996, -122.388856204292)   14    04    1
## 4  (37.7857883766888, -122.412970537591)   23    01    0
## 5  (37.7650501214668, -122.419671780296)    0    01    0
## 6   (37.788018555829, -122.426077177375)   21    01    0
## 7  (37.7808789360214, -122.405721454567)    0    01    1
## 8  (37.7839805592634, -122.411778295992)    1    01    0
## 9  (37.7757876218293, -122.393357241451)   12    01    1
## 10 (37.7209669615499, -122.387181635995)   10    01    0

Since the outcome is binary, We used a Logistic regression this time to fit the model. And we use the broom::tidy function to show the resulting model.

auto_fit2 <- glm(reso~DayOfWeek+hour, tidy_new2, family = poisson())
auto_fit_stats2 <- auto_fit2 %>%
  tidy() 
auto_fit_stats2 %>% knitr::kable()
term estimate std.error statistic p.value
(Intercept) -1.2266639 0.0161291 -76.0529152 0.0000000
DayOfWeekMonday 0.0696434 0.0181777 3.8312433 0.0001275
DayOfWeekSaturday 0.0038328 0.0181806 0.2108207 0.8330272
DayOfWeekSunday 0.0547048 0.0183716 2.9776817 0.0029044
DayOfWeekThursday 0.1082747 0.0178772 6.0565754 0.0000000
DayOfWeekTuesday 0.1025907 0.0179313 5.7213216 0.0000000
DayOfWeekWednesday 0.1287907 0.0178007 7.2351493 0.0000000
hour -0.0074859 0.0007434 -10.0703299 0.0000000
summary(auto_fit2)
## 
## Call:
## glm(formula = reso ~ DayOfWeek + hour, family = poisson(), data = tidy_new2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8168  -0.7630  -0.7386   0.9847   1.1364  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.2266639  0.0161291 -76.053  < 2e-16 ***
## DayOfWeekMonday     0.0696434  0.0181777   3.831 0.000127 ***
## DayOfWeekSaturday   0.0038328  0.0181806   0.211 0.833027    
## DayOfWeekSunday     0.0547048  0.0183716   2.978 0.002904 ** 
## DayOfWeekThursday   0.1082747  0.0178772   6.057 1.39e-09 ***
## DayOfWeekTuesday    0.1025907  0.0179313   5.721 1.06e-08 ***
## DayOfWeekWednesday  0.1287907  0.0178007   7.235 4.65e-13 ***
## hour               -0.0074859  0.0007434 -10.070  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 107594  on 150499  degrees of freedom
## Residual deviance: 107399  on 150492  degrees of freedom
## AIC: 192855
## 
## Number of Fisher Scoring iterations: 5

To check if this model is good to use, we plot a scatter plot for fitted value and residuals. From the plot, we obesrve a non-random pattern which supports a nonlinear model, this may suggest we should linear model here is not the best model.

auto_fit2 %>% 
  augment() %>%
  ggplot(aes(x=.fitted, y=.resid)) +
    geom_point()

For this example, We want to find out if there is a relationship between hour and the number of incident for that time. First, we need to prepare our dataset, we use a new attribute hour_count as a indicator of how many incident occurs at that hour.

tidy_new <- tidy%>%
  group_by(hour) %>% 
    summarize(hour_count = n()) 
    
tidy_new <- left_join(tidy, tidy_new, by = "hour")
  head(tidy_new,10)
##    IncidntNum       Category                                       Descript
## 1   120058272    WEAPON LAWS                      POSS OF PROHIBITED WEAPON
## 2   120058272    WEAPON LAWS FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE
## 3   141059263       WARRANTS                                 WARRANT ARREST
## 4   160013662   NON-CRIMINAL                                  LOST PROPERTY
## 5   160002740   NON-CRIMINAL                                  LOST PROPERTY
## 6   160002869        ASSAULT                                        BATTERY
## 7   160003130 OTHER OFFENSES                               PAROLE VIOLATION
## 8   160003259   NON-CRIMINAL                                    FIRE REPORT
## 9   160003970       WARRANTS                                 WARRANT ARREST
## 10  160003641 MISSING PERSON                                   FOUND PERSON
##    DayOfWeek       Date       Time PdDistrict     Resolution
## 1     Friday 2016-01-29  11H 0M 0S   SOUTHERN ARREST, BOOKED
## 2     Friday 2016-01-29  11H 0M 0S   SOUTHERN ARREST, BOOKED
## 3     Monday 2016-04-25 14H 59M 0S    BAYVIEW ARREST, BOOKED
## 4    Tuesday 2016-01-05 23H 50M 0S TENDERLOIN           NONE
## 5     Friday 2016-01-01     30M 0S    MISSION           NONE
## 6     Friday 2016-01-01 21H 35M 0S   NORTHERN           NONE
## 7   Saturday 2016-01-02      4M 0S   SOUTHERN ARREST, BOOKED
## 8   Saturday 2016-01-02   1H 2M 0S TENDERLOIN           NONE
## 9   Saturday 2016-01-02 12H 21M 0S   SOUTHERN ARREST, BOOKED
## 10    Friday 2016-01-01  10H 6M 0S    BAYVIEW           NONE
##                    Address         X        Y
## 1   800 Block of BRYANT ST -122.4034 37.77542
## 2   800 Block of BRYANT ST -122.4034 37.77542
## 3    KEITH ST / SHAFTER AV -122.3889 37.72998
## 4   JONES ST / OFARRELL ST -122.4130 37.78579
## 5     16TH ST / MISSION ST -122.4197 37.76505
## 6    1700 Block of BUSH ST -122.4261 37.78802
## 7      MARY ST / HOWARD ST -122.4057 37.78088
## 8     200 Block of EDDY ST -122.4118 37.78398
## 9        4TH ST / BERRY ST -122.3934 37.77579
## 10 100 Block of CAMERON WY -122.3872 37.72097
##                                 Location hour Month hour_count
## 1   (37.775420706711, -122.403404791479)   11    01       6786
## 2   (37.775420706711, -122.403404791479)   11    01       6786
## 3  (37.7299809672996, -122.388856204292)   14    04       7621
## 4  (37.7857883766888, -122.412970537591)   23    01       6573
## 5  (37.7650501214668, -122.419671780296)    0    01       6941
## 6   (37.788018555829, -122.426077177375)   21    01       7480
## 7  (37.7808789360214, -122.405721454567)    0    01       6941
## 8  (37.7839805592634, -122.411778295992)    1    01       4359
## 9  (37.7757876218293, -122.393357241451)   12    01       9021
## 10 (37.7209669615499, -122.387181635995)   10    01       6483

Then, we can build a regression tree model for number of incident by the hour. Regression tree is not linear regression model. Prediction trees use the tree to represent the recursive partition. Each of the terminal nodes, or leaves, of the tree represents a cell of the partition, and has attached to it a simple model which applies in that cell only. A point x belongs to a leaf if x falls in the corresponding cell of the partition. To figure out which cell we are in, we start at the root node of the tree, and ask a sequence of questions about the features.

auto_fit <- tree(hour_count~hour, tidy_new)
plot(auto_fit)
text(auto_fit, pretty=0)

Conclusion

In our tutorial, we covered data preparation and tidying, visualizaiton of attributes, application of interactive maps, data prediction and regression analysis. Within our analysis, we are discovering the trend and distribution of attributes in the dataset to help you to formulate an general instinct of data analysis in way of data science. Moreover, we helped the authorites to conclude the overall situation of crimes at San Fransico in the year of 2016. We offered plenty of tips to the authorities basing on our analysis.

Resources

Our tutorial only coveres some applications of method and stuff in those libraries, we encourage you to take a look at these online resources for further learning and help: